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STRUCTURAL ALIGNMENTS OF PSEUDO-KNOTTED 
RNA-MOLECULES IN POLYNOMIAL TIME 

\Q • MICHAEL BRINKMEIER 

O' 

o 

^vj , Abstract. An RNA molecule is structured on several layers. The primary 

and most obvious structure is its sequence of bases, i.e. a word over the 

alphabet {A, C, G, U}. The higher structure is a set of one-to-one base-pairings 

^ resulting in a two-dimensional folding of the one-dimensional sequence. One 

■^J>j , speaks of a secondary structure if these pairings do not cross and of a tertiary 

structure otherwise. 

Since the folding of the molecule is important for its function, the search for 
related RNA molecules should not only be restricted to the primary structure. 
It seems sensible to incorporate the higher structures in the search. Based 

r^ ■ on this assumption and certain edit-operations a distance between two arbi- 

Q' trary structures can be defined. It is known that the general calculation of 

this measure is NP-complete IZWM021 . But for some special cases polynomial 
C^ , algorithms are known. Using a new formal description of secondary and ter- 

O ' tiary structures, we extend the class of structures for which the distance can 

be calculated in polynomial time. In addition the presented algorithm may be 
used to approximate the edit-distance between two arbitrary structures with 
a constant ratio. 

in ' 

o 
^. 

O , 1. Introduction 

^^ ■ Ribonucleic acid (RNA) is structured on three levels. The primary and most 

j/j I obvious structure is the underlying sequence of bases. The higher layers of structure 

O ■ are given by its folding, i.e. its pattern of base pairings. As long as the structure 

is nested, one speaks of a secondary structure. If crossed pairs or pseudoknots 

ij exist, the molecule is of tertiary structure^. Since the folding and the embedding 

rS \ into the three-dimensional space are important for the functional properties of an 

j^ ■ RNA molecule, it is of some interest to compare different molecules based on the 

secondary and tertiary structure and not only on the primary structure. 

Restricting to the primary structure, the comparison of two or more RNA strands 
is efficiently solvable by the same techniques used for the alignment of DNA se- 
quences |Gus97| . For a given set of (weighted) edit operations, the edit-distance, 
i.e. the minimal number of operations needed for the transformation of one sequence 
into the other, is calculated. This results in an alignment of the two structures. 

In the literature this approach is transferred to higher structures of RNA mole- 
cules in various ways. Some of them rely on the tree representation of secondary 
structures and measure the tree- edit- distance (eg. |SZ90[ IZha96al IZhagGbl l. some 



Key words and phrases. RNA secondary structure, alignment, edit distance. 
^There exist two definitions of tertiary structure, one as given above, and an alternative defi- 
nition as the spatial arrangement of bases. 
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Figure 1 . Complexities of alignment algorithms 



involve stochastic context-free grammars |SBH"'"94al lSBH+94b| . Additional meth- 
ods can be found in BMR9Q and |LRV98| . But most of them, mostly due to formal 
restrictions, cover only nested sequences. 

As K. Zhang et al. do in |ZWM02| and |JLMZ02| . we treat unpaired bases 
and base-pairings as atomic units. There exist several variations of this approach. 
First of all, one has to choose a set of basic edit-operations. Secondly, one has to 
chose the scores or weights for these operations, which may possibly depend on the 
underlying bases. Since we view unpaired bases and pairings as unbreakable units, 
it is only allowed to replace an unpaired base/pairing by another one, to remove it 
or to introduce a new one. It is not allowed to remove only one partner of a pairing, 
to break a pairing without removing the bases etc. 

These edit-operations are quite restrictive, compared to the more flexible set used 
in |JLMZ02| . But nonetheless, the general problem of finding a minimal alignment 
(sequence of edit-operations) is NP-complete, as long as the edit-operations do rely 
on the higher structures, i.e. the pairings (this is known for several choices of edit- 
operations ( ZWMD2^ JLMZ02 ) . So far polynomial algorithms only are known for 
alignments of a secondary structure with an arbitrary one^. 

We are going to present a new description of higher structures of RNA , using a 
formal system closely related to graph grammars. In the first section the formalism 
is defined and a set of generators, i.e. certain small structures which are used to 
construct larger, so-called decomposable structures, is presented. As we will see, this 
description includes nested structures and a wide variety of pseudoknots, but not 
all. Following that, we are going to describe our edit-operations and the resulting 
type of alignment of two tertiary structures. Connecting this with our formalism 
leads to the essential observation that the alignments of decomposable structures 
are of a special type, called semi-decomposable, meaning that a core structure of 
the alignment, consisting of all matched/mismatched pairings and bases, still is 
decomposable. This results in the framework of an algorithm calculating the exact 
score of a minimal alignment of a so-called decomposable structure with an arbitrary 
one. This algorithm runs in a time polynomial in the number of bases of the 
aligned sequences and polynomial space. The degrees of the polynomials depend 
on the choice of generators. In fact it is possible to extend the set of decomposable 
strucutres easily by the introduction of additional generators. This increases the 
degree of the polynomials, but the required time and space remain polynomial, even 
though the general problem is NP-complete. 

Finally we will give some results about the approximation ratio of the algorithm 
for arbitrary pairs of RNA molecules, depending on the chosen scores. 



The algorithm of Zhang can be adapted to cover certain restricted H-hke pseudoknots instead 
of nested structures. But the details are not given in the cited paper 
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Unfortunately the runtime and especially the space requirements prohibit an 
actual implementation of this algorithm. But nonetheless, they may be the foun- 
dation of a family of more efficient algorithms. Ideas, how this basic algorithm can 
be improved, are given in the last section of this paper. 

2. Higher RNA Structures 

Usually an RNA molecule is represented by a sequence of bases, i.e. a word s 
over the alphabet Erna = {A, C, G, U} together with a set of pairings, i.e. a set of 
pairs (i,j) with 1 < « < j < |s|. Graphically these are represented by a structure 
(multi-)graph with vertex set {1, . . . , |s|}. The i-th base corresponds to the vertex j, 
and two subsequent vertices/bases are connected by a backbone{-edge) . In addition 
each pairing (i, j) is represented by an edge between i and j. 

We require an additional type of RNA-structures, introducing a gap between two 
bases, as already used by Rivas and Eddy in 'RE 9 9". These will be called gapped or 
1-structures (An example is given in Fig. \^. In addition we are going to seperate 
the structure from the sequence of bases, providing us with a simplified notation. 

For ri £ N and n > 1 let n denote the finite set {1, . . . , n} C N and the empty 
set. 

Definition 2.1 (0- and 1-Structures). A 0-structure a = {n,P) ('or structure 
of type 0) consists of a natural number n and a set P C n x n of pairs {i,j) with 
i < j, such that for {i,j), (*', j') £ P the intersection {i,j} H {i',j'} is either empty 
or{i,j} = {i\j'}- 

A 1-Structure a — {n,P,k) (or structure of type 1) consists of a 0-structure 
(n, P) and a natural number < k < n. 

The elements of n are the bases and the pairs {i,j) G P the (base-) pairings. A 
base i G n of cr is paired if there exist a pairing (i, j) or (j, i) , and unpaired otherwise. 
Furthermore each base is paired with at most one other base. The pairings and 
unpaired bases are also called structural elements. In a 1-structure a — (n, P, k) the 
sequence of bases is split after base k into two intervals [i^k] and [fc + l,n] called 
legs. 

We explicitly allow the legs to be empty (by setting k — and k = n). This 
allows us to view 0-structures as 1-structures with one empty leg. Therefore we may 
use these cases to relate to 0- and 1-structures without mentioning them explicitly. 




RNA structure with 8 bases and pairings {(1, 6), (2, 5), (3, 8), (4, 7)} 




The same structure with a gap between bases 4 and 5. 
Figure 2. Examples of RNA structures 
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ido — • idi — * • empti/Q — o empttji = o o 

Figure 3. The trivial structures 



• • • 

J i' j' 




Figure 4. Independent, nested and crossing pairings 



Since all pairings {i,j) satisfy i < j, we do not need to differentiate between 
(i,j) and (j, i). Furthermore, if a pairing (i,j) is given we assume i < j, unless 
stated otherwise. Furthermore we simply write (i, j) G cr = (ri, P) for pairings in P 
and i ^ a for unpaired bases. 

There exist four trivial structures containing at most one structural element. The 
identities simply consist of a free base or a single pairing. The empty structures are 
structures with no bases. In the graphical representation (see Fig. ISJ we use an 
empty circle "o" for empty legs. 

Two pairings (i, j) and (i',j') with i < i' are called independent ii i < j < i' < j', 
nested if i < i' < j' < j and crossed ii i < i' < j < j'. A 0- or 1-structure a is 
called nested if any two pairings are either independent or nested (see Fig. EJ. A 
structure containing crossed pairings is called pseudoknot or pseudoknotted. 

The natural numbers induce a total order on the structural elements. More 
precisely we define 

Definition 2.2 (Tiie Order of Structural Elements). Let a be a structure. 
The relation <„ on the set of structural elements is given by: 

• i <CT j */ Ci^d only if i < j . 

• i <CT {i',j') if and only if i < i' . 

• {i',j') <a i if and only if i' < i. 

• ihj) <<y ii',j') if and only if i < i' . 



2.1. Structures and Sequences. Up to this point we only described the structure 
of RNA-molecules, but not the sequence of bases. In fact we put some effort in the 
seperation of the secondary structures from the actual sequence of nucleotides. 

Definition 2.3 (Folded E-Sequences). Let E be an alphabet, i.e. a finite non- 
empty set. A E-sequence of type is a word in E* and a E-sequence of type 1 is 
a pair s — (si, S2) of words in E*. 

A folded E-sequence is a pair {a, s) consisting of a structure a and a T,-sequence 
s of the same type as a = (n, P, k), such that 

• \s\ — n if a is a 0-structure and 

• |si| = fc and \s2\ ^ n — k if it is a 1-structure. 

For RNA molecules the standard alphabet is Erna = {^4, C, G, U}. But nonethe- 
less it is useful to use this general definition, because in the context of alignmets 
bases may be deleted or inserted, resulting in "empty" bases, which correspond to 
the fifth letter "o" . 
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Figure 5. Compositions along a unpaired base and a pairing 



2.2. The Compositions. We will especially consider structures constructed from 
smaller ones by two basic operations, called compositions. The first replaces an 
unpaired base of an arbitrary structure by a whole 0-structure. The second com- 
position does exaclty the same with a pairing and a 1-structure. More formally we 
use the following definition. 

Definition 2.4 (The Compositions). 

(1) Let a ~ (n, P, k) be a structure and i one of its unpaired bases. For each 
0-structure t = {m, Q) the structure c Oj r is defined by 

cr Oj r := (n + ?7i — 1, P Oj Q, k') 



k' := 
and P o 

• O'l: J2) for all {ji,J2) e P with j'^ = 



with 

k if k < i 

k + m — I if k > i 
contains the following pairings 

jx if.jx<i 

Jx+m-1 ifi<jx 
• (ji + i,.i2 + i) for all (ji, J2) e Q- 
The operation — Oj — is called composition along the (unpaired) base i. 

(2) Let a = {n,P,k) be a structure and {i,j) one of its pairings. For each 
1-structure t — {m, Q, I) the .structure a °u,j) t is defined by 

^ °(i.]) T ■■= {n + m-2,P o(j^j-) Q, k') 

with 

{k if k < i 

k + l-1 tfi<k<j 
k + m — 2 if j < k 
and P 0(i,j) Q contains the following pairings 



Jx 



• U'i.j'2) for all (ji, J2) e P\{{i,j)} withj'^ = {jx+l-l 

Jx + m - 2 

\ jx+i-l if .jx < I 
[jx+j-l ifl<jx- 
The operation — o(i.j) — is called composition along the pairing {i,j) 



if jx < i 

if i <3x < i 

ifj <jx 



• (Ji7 J2) for all {ii,J2) e Q with j'^ = 



By definition the compositions preserve the type of the first argument, i.e. if it 
is a 0/1-structure, then the resulting structure again has type 0/1. 



MICHAEL BRINKMEIER 



concat = •■■ 




Type 1 



disconn 



lembed = 



ihed = o 



Iconcat = , ,-• ^, 



rconcat 



• > » 



linsert = ^ ■^ » \, 




rinsert = 



rwrap = 




nest = 



Figure 6. The generators 



Obviously composition with an identity idk has no effect and composition with 
the empty structures empty f. deletes the strucutral element along which it is com- 
posed. 

Since the composition along an unpaired base or a pairing does not affect the 
remaining structural elements, we may compose simultanously along all of them. 
This is denoted by a o (n, . . . ,Ti), where t; is composed with a along the i-th 
structural element, ordered by <(j. 

As already mentioned, the compositions provide means to build larger structures 
from smaller ones. To get an efficient description, one needs to use a set of basic 
structures, called generators, and to restrict to decomposable structures, i.e. those 
which are compositions of generators, reducing the search space. 



Let T be a finite set of 0- and 1- 
A structure a is F-decomposable 



Definition 2.5 (Decomposable structure). 

structures. Its elements are called generators. 

(1) it is an identity (idg and idi), 

(2) or there exists a generator t G F and T -decomposable structures Ui for 
1 < * < ^, such that (7 = T o [ai, . . . , ai). 



For biological purposes we suggest the generators given in Fig. 1^1 In |lJau04| 
experiments confirmed that all pseudoknots in PseudoBase [pSS] are decomposable 
with respect to these generators. This is a strong indication that indecomposable 
structures are biologically less relevant. 

The generator disconn allows the construction of disconnected 1-structures, con- 
sisting of two 0-structures. 
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Observe, that all nested structures can be described as compositions of concat, 
loop, Iconcat, rconcat and nest. Nonetheless, not all structures can be build from 
the generators in Q (eg. the structure shown in Fig. 0)- 



3. Alignments 

One way to measure the similarity between two folded sequences is the calcu- 
lation of the edit distance, i.e. the minimal number (or score) of allowed (usually 
reversible) operations needed to construct one sequence from the other. In the 
unstructured case this distance is precisely the score of an alignment of the two 
sequences ( |Gus97| V This also holds for folded sequences, as long as one restricts 
to certain edit operations, which are: 

(1) base-replacement or base-mismatch, replacing the character at one unpaired 
base with a different one 

(2) base-deletions, removing an unpaired base 

(3) base-insertions, adding an unpaired base 

(4) pair-replacement, replacing the characters at the ends of a pairing and 
changing at least one of them 

(5) pair- deletion, removing both ends of a pairing 

(6) pair-insertion, adding a pairing 

For the representation of inserted and deleted characters we use blanks o. For an 
arbitrary alphabet S let E be the disjoint union of E with the blank {o} (Graph- 
ically we represent blanks by empty circles, while non-blanks (i.e. characters) are 
represented by discs). 

For an arbitrary S-sequence s the E-sequence 7r(s) is obtained by removing all 
occurrences of the blank from s. In the same way the folded E-sequence 7r(cr, s) 
may be constructed from a folded E-sequence (cr, s), i.e. all bases of a associated 
with a blank are removed. If at least one base of a pairing is associated to a blank, 
the whole pairing is deleted. 

A (structural) alignment is the description of two folded sequences obtained from 
each other by a sequence of basic edit operations, without remembering them in 
detail. But in fact a possible edit sequence between both structures may easily be 
constructed from an alignment, and vice versa. 




Figure 7. A non-decomposable structure 




o o 



o o 



base operations arc-match/ arc-deletion/ 

arc-mismatch arc-insertion 



Figure 8. The graphical representation of the edit operations 
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Definition 3.1 (Alignment). Let {ak,Sk) — {{uk, Pk),Sk),k — 1,2, be two folded 
sequences of same type. A (structural) alignment (a, ii,t2) between (cri,si) and 
(o'2,S2) is a structure a of same type, together with two T,-sequences ti,t2, such 
that 

• {ak,Sk) ^ Tr{(T, tk) for k ^ 1,2 and 

• {sk\i],Sk[j]) e S X SU{(o,o)} for {i,j) ea andk^ 1,2. 

More intuitively, an alignment consists of a structure a, such that the structures 
CTi and (72 can be obtained by removing bases and pairings. The bases and pair- 
ings which have to be removed are indicated by the position of the blanks in the 
sequences. The second condition ensures that either both bases in a pairing are 
associated to blanks, or none. Two examples of alignments are shown in Fig. |51 

Instead of counting the minimal number of edit operations needed for the trans- 
formation of one sequence into another, we assign a non-negative score S{a,ti,t2) 
to an alignment and try to minimize it. The score is the sum of scores of the 
structural elements of a, depending on both, the corresponding bases of the aligned 
structures, and the underlying structural elements. For each unpaired base of the 
alignment we use scores 



S 



,yj v°/ \yy 

where x and y are arbitrary characters. For pairings we have scores of the following 
forms. 

^l'xi,X2\ c'/a;i,a;2^ 

\yi1y2J V ° ' ° . 

The score of an alignment (cr, ^1,^2) is defined by 



S{a,ti,t2) 






E 



5* 



S 



yi,y2 






In addition to the scores being non-negative, we require 



x,y 
x,y 



0,0 

o ,0 



The score of a minimum alignment of two folded sequences is 

S'((cri,si), (0-2,52)) := min{5'(cr,ti,i2) | 7r(cr,ife) = {ak,Sk),k = 1,2}. 

3.1. Semi-decomposable Alignments. As shown in |ZWM02] the general prob- 
lem of finding the score of a minimum alignment is NP-complete. But we are 
going to describe an algorithm calculating the exact minimum alignment of a de- 
composable folded sequence and an arbitrary one, using the fact that an arbitrary 
alignment of these is semi-decomposable, as defined below. 





Figure 9. Two structural alignments of two folded sequences 
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In the following F is a finite set of generators including those in Fig. El The 
notions of semi- and deconiposability will always refer to this set. 

Let P be a set of pairings in a, then a\P is the structure obtained by removing 
all pairings (including their bases) in P. If (ct, s) is a folded sequence we define 
{cr,s)\P := {a\P, s\P), where s\P is constructed from s by removing all letters 
associated to the bases of pairings in P. 

As we will see an alignment of a decomposable folded sequence with an arbitrary 
one is semi-decomposable in the following sense. This observation and Lemma IT51 
give us a possibility to calculate the minimal alignment by decomposition. 

Definition 3.2 (Semi-decomposable). An alignment (cr, ^1,^2) of two folded se- 
quences {ai,si) and (172,52) is called semi-decomposable if there exists a set P of 
pairings in a, such that 

(1) {tk[i],tk[j]) = (0,0) for one k G {1,2} and each (i,j) G P, 

(2) and the structure a\P is decomposable. 

For an arbitrary folded 0-sequence (ct, s) of length n and 1 < i < j < n let 
cr[?, j] be the structure obtained from a by removing all unpaired bases outside the 
interval [i,j] and all pairings with at least one end outside of it. Furthermore we 
define (a, s)[i,j] :— {a[i,j], s), where s is obtained from s by deletion of the letters 
assigned to the deleted bases. Similar the 1-structure (c, s)[zi, j'l; 12, J2] is defined 
for ii < ji <i2 < J2- 

Let T be an arbitrary structure with m bases and a one of same type with n 
bases. A r-splitting of ct is a partition of the interval [l,n] into m subintervals 
I^,. . . ,/™ with /' = [i'-.j''], which respect the gaps if r and a have type 1, i.e. 

• i^ = 1, j^ + 1 == i'^+^ for 1 < fc < m and j™ = n, 

• J — fc if (T if the gap in r is between I and / + 1 and the gap in a between 
fc and fc + 1. 

A pairing (J, j) of a is called incompatible with the splitting, ii i £ P and j G J-' 
with i' =/= j/' and {i',j') is not a pairing in r, i.e. the two bases of the pairing lie in 
two different intervals of the splitting, which aren't paired in r. A r-splitting of a 
is called proper if the induced splitting of t\P contains no empty interval, where 
P is the set of incompatible pairings. In other words a r-splitting is proper if and 
only if each interval contains at least one unpaired base or one end of a compatible 
pairing. 

The notion of proper splittings allows an equivalent description of semi-decomposable 
structures. 

Lemma 3.3. An alignment [a, ti,t2) o/(cri,si) and {a2, S2) is semi-decomposable, 
if and only if either a is an identity, a generator, or if 

(1) there exists a generator t with m bases and i structural elements, 

(2) proper t -splittings I^, ■ ■ ■ P^ of Uk for k — 1,2, 

(3) for each structural element X of t exists a semi-decomposable aligment 
{a^,t\,t^) o/(cri,si)[/j] and (cr2,S2)[/|] ifx^ier, or 0/ (cti, Si)[/i; /f] 
and ((72,32 ) [/a; /|] ifx^{hj)^T, 

(4) and ak\Pk = t o [TT{a^,tk), . . . , Tr{a^, tk)) for k — 1, 2, where Pk is the set 
of incompatible pairings of Uk ■ 

Proof. If a is an identity or a generator the lemma obviously holds. 
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Suppose that the ahgnment (ct, ii,t2) is semi-decomposable. If a is only an 
unpaired base, then it obviously is a decomposition and P' = 0. The same holds 
if a has only a single pairing. Now assume that P is the set of pairings defined in 
the definition of semi-decomposability, ie. a\P = t o [a^ ^ . . . , cr'") for a generator 
T and decomposable structures a^,. .. ,(j™. Then one can chose a r-splitting of 
cr, such that the induced splitting of r\P is the one given by the decomposition 
(simply add the bases of the deleted pairings to appropriate intervals). Since the 
generators do not allow the deletion of bases or pairings, this splitting is proper 
and some of the pairings in P are compatible with it and some aren't. Let P' Q P 
be the subset of the latter ones. The compatible pairings can be added to the cr* 
resulting in structures a^,. .. ,a"^. By induction these induce semi-decomposable 
subalignments, because they contain less structural elements than a and removal 
of the added pairings results in decomposable structures. Furthermore we have 
cr\P' = ro(CTi,...,6-™). 

Now assume that there exists a generator r and a proper r-splitting as stated 
in the lemma. Since the subalignments are semi-decomposable and contain less 
structural elements than a, there exists a set P^ of pairings in a^ for each structural 
element x G t, such that a^\P^ is decomposable and each pairing in P-^ is assigned 
to blanks in at least one sequence. If P is the set of pairings incompatible to the 
splitting, this leads to 

a\Pu\JP^^To{a\P,,...,a'\P,), 

proving the semi-decomposability of (cr, ti,t2). □ 

Corollary 3.4. Any alignment (cr, ti, ^2) of a decomposable folded sequence (cri, si) 
and an arbitrary folded sequence (cr2,S2) is semi-decomposable. 

Proof. Choose P' as the set of pairings (i,j) in cr matched by blanks, ie. (i2H,^2b]) = 
(0,0). Then a\P' is exactly cti with additional unpaired bases. Since unpaired bases 
may be added at any position by composition with generators, cr\P' is decompos- 
able. D 

As a consequence of Corollarv l^{.4l it is sufficient to find the minimum semi-de- 
composable alignment if at least one of the sequences is decomposable. An arbitrary 
semi-decomposable alignment can be constructed in the following way. 

(1) Choose a generator r. 

(2) Choose two proper r-splittings of the structures. 

(3) Find all incompatible pairings. 

(4) Align the subsequences (without incompatible pairings) induced by the 
splittings. 

Then the score of the alignment is the sum of the scores of the subalignments and 
the scores of incompatible pairings matched against blanks. This approach leads 
to the algorithm described in detail in the following section. 

3.2. The Algorithm. The minimum alignment of two folded sequences is cal- 
culated using dynamic programming. We use two arrays, indexed by intervals. 
For two folded 0-sequences (cri,si) and (0-2,52) the value S'o[/i;/2] is the score 
of a minimal alignment of the 0-structures {ai, si)[Ii] and (cr2, S2)[/2]- Similar 
'S'i[/i, Ji'jh, J2] is the score of a minimal alignment of the two 1-sequences (cri, si)[/i; Ji] 
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and {(72, S2)[l2', J2]- The values i?fe[/; J] arc the sums of all wez(;/ii5 of pairings {i,j) 
in ((7k, Sk) such that one end is in / and the other in J, i.e. 



Rk[I;J]:^ Y, S 

{i,j)e<yk,ieijeJ 



Skm,Sk[j_ 



W{a, s) is the score of the sequence {a, s) (aligned with blanks), i.e. 

ie<T ^ ' {i,j)ecr ^ ' 

For shorter notation we write: 

Wk[I] ■.= Wiiak,Sk)[I]) and VKfc[/;J] -.^ W i{ak,Sk)[r, J]) 

The entries of the arrays Sq and 5*1 can be calculated recursively using scores 
for shorter intervals. The recursion stops if all intervals consist of only one base, 
i.e. Ik = [ik,ik] and Jk = [jk,jk]- Then we have: 
(1) 

min 5 \ \ , T4^i[/i] + VK2[/2] if «i and 12 are unpaired 

So[h;l2\ = < \ \s2Mj J 

[ Wi [h ] + W2 [I2 ] otherwise 

(2) 5i[/i,Ji;/2,J2] 

(*ii.?i) and {12, J2) are pairings 
VKi[-^i; Ji] + VF2[/2] + W2[J2] only (ii, ji) is a pairing 

W^2[/2; J2] + Wi [h] + Wi [Ji] only (J2, J2) is a pairing 

Wi [h] + Wi [Ji] + W2 {H + W2 [ J2] otherwise 

In general, we have to check every generator r of type and every pair 7^ = 
/^ . . . J™, k — 1,2, of proper r-splittings of (7k[Ik]- The score of this decomposition 
is 

(3) Xo (r, /!,..., 4"):- 

J2So[il;P2]+ E s,[il,ii;im]+ E {Ri[il;ii] + R2[P2;H])- 

The first two sums are the scores contributed by subalignments induced by the 
unpaired bases and pairings in t. The third sum is the score of incompatible 
pairings. 

Since the splittings are proper, the intervals II aren't empty and therefore are 
shorter than Ik- Hence Xq{t,I\, . . . ,1]^) can be calculated from the scores of 
subaligments of shorter intervals. 
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The score 5*0 [/i; I2] is the minimum over all generators of type and all proper 
splittings. For concat and loop this leads to 

' So [II ; /I] + So [/? ; /|] (r = concat) 



So[h;l2] = min< 



+i?i[/i;/2]+i?2[/l;^|] 



where Ik = -^^-^^1 ^ = 1,2, are decompositions of the intervals. 

Similar, Si can be calculated as the minimum score over all decompositions 
using a generator r of type 1. In general, one obtains the following score for 
Si[Ii, Ji] I2, J2] if T has non-empty legs with m and I bases, m,l < 1: 



(4) Xi(r,/i,...,/r+'):= 

]So[i{:P2]+ Y. Si[ilii;ii,i^+ J2 {Ri[il;n] + R2[P2;m 



rl rm 7 _ jm+1 jrn+l 



where h = H ■ ■ ■ IZ\ Jk = 4"+' • ■ ■ J^ ^r fc = 1, 2. 

Again the score 5*1 [Ii , Ji ; /2 , J2] is the minimum over all sums for all generators 
of type 1 and all appropriate splittings. For some of the generators given in Fig. |H| 
this results in the formulas given in Tab. ^ 

Algorithm 1: Calculation of the score of a minimal strucutral alignment 
Input: Two folded sequences {ak, Sk), k = 1,2 oi type 0, one of them has to 

be decomposable 
Output: The score of a minimum structural alignment between both 

begin 

rii ^- number of bases in ai; 

712 ^- number of bases in a2 ; 

Mark all entries So [Ii ; h] and Si [Ii ,Ji';l2,J2] as undefined; 

return S'o((cri, Si), (aa, S2), (1, "-1), (1,«2)); 
end 



Function SoCSi, S2,Ii,l2) 
Input: Two folded 0-sequences Sk = {(Jk, Sfc) and two intervals Ik, k = 1,2 
Output: The score of a minimum structural aligment of (cri, si)[/i] and 

(ct2,S2)[/2] 

begin 

if 5'o[/i;/2] is defined then return 5o[/i;/2]; 
forall generators r of type do 

forall proper T-splittings Ik ^ ll ■ ■ ■ IJ^ , "^ = I'^l do 
X ^ Xo{t, ll, ■ ■ ■ I™) as given by Eqn. 0) or (0); 
if a; < S'o[/i; /2] then 5o[/i; /2] ^ x; 
end 
end 

return So[Ii;l2]; 
end 
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Siih, 


Ji;l2,J2] = 






' So[Ii-j2] + So[Ji; J2] 


(r = disconn) 




+Ri[h;Ji] + R2[l2;J2] 






So[ll;I^] + S,[l!,J,;llJ2] 


for 1 1 = If If and I2 = lllj 




+i?i[/i;/2]+i?i[/i;Ji] 


li^h 




+R2[It,m+R2[lhJ2] 


(r — Iconcat) 




Si [II , If ; /I , /|] + 5i [If , Ji ; /| , J2] 


for h = /i/f /? and h = I^I^ll 




+ i?l[/l;/2]+i?i[/2;/3] 


Il^h 




+Ri[ll,Ji]+Ri[lf;Ji] 


{t — Iwrap) 




+R2[lhli]+R2[li;ll] 




mill < 


+R2[i2;J2] + Ri[il;J2] 






5i[/i, .If; II Jl\ + S,[ll Jl-Jl Jl] 


{ovh=llllJk^Jl4 




+R,[lhll]+RAlhJl] 


If ^ I\ Jf ^ Jfc 




+R,[lhJl]+R,[JhJl] 


(r = nest) 




+R2[lhli]+R2[lhJl] 






+ R2[llJi]+R2[JhJl] 






SAllJhllJl] + S,[llJhllJl] 


iorh=llllJ,^Jl4 




+Ri[ll;lf]+Ri[Il;Jf] 


If ^ I'' . .If ^ J'' 




+Ri[lf;Jl]+Ri[Jt,J!] 


(r = cross) 




+i?2[/i;/|]+i?2[/l;J|] 






+R2[ll,J^]+R2[J2;Jl] 




Table 1. 5*1 for the generators disconn 


Iconcat, Iwrap, nest and cross 


Function Si{Si,S2,h,Ji,h,J2) 



Input: Two folded 0-sequences Sk — {(Jk, Sk) and four intervals Ik, Jk, 

k = l,2 
Output: The score of a mininiuni structural alignient of (cri, si)[/i, Ji] and 

(a'2,S2)[/2, J2] 

begin 

if So [Ii , Jl ; /2 , J2] «s defined then return 5o [/i , Ji ; /2 , J2] ; 
forall generators r 0/ type 1 do 

forall proper r-splittings h ^ ll . . . F,!', Jk = 1^+^ • . . /"+' do 
X ^ Xi{t,II,..., /"+') as c/wen by Eqn. |3) or gj); 
if a; < 5*1 [/i, J2; ^2, J2] then 5i[/i, Ji; /2, J2] ^ x; 
end 
end 

return S'i[/i, Ji; J2, J2]; 
end 
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Theorem 3.5. LetV he a set of generators, i.e. a finite set of 0- and 1-structures, 
including those in Fig. O and m the maximal number of bases in any leg of any 
generator r € F. The score of a minimum alignment of a T -decomposable folded 
sequence with ni bases, and an arbitrary one with n2 bases, can be calculated in 
0(n™ n^ ) time and 0(nfn|) space. 

Proof. There are 0{n1) weights W{{ak,Sk)[I]), which can be calculated in 0{nk) 
each, leading to 0{n\ + n^) time and 0{n\ + n^) space. Similar the Rk[T- J] need 
0{n\ + n\) space and 0{n\ + n^) total time. 

The arrays 5*0 and 5*1 have 0{nln'^) and 0{n\n\) entries. We assume that the 
calculation of Xq (t, . . . ) and Xi (r, . . . ) using equations Q and Q require constant 
time (this is the case if the equations for each generator r are directly programmed 
and not dynamically evaluated). Furthermore already known scores are stored such 
that each score has to be calculated at most once. 

Since the Rk[I\ J] and the weights W are already known, the time required for 
each splitting of the intervals is constant, and the check wether a splitting is proper 
takes 0{ni + 712) time. Hence it is sufficient to count the splittings for each entry. 
For Sq each interval is split into at most m parts, resulting in 0{n™~^n^~^) cases 
and 0(n™^^n™^^) time. For ^i each interval is splitted into at most m subintervals, 
leading to Oin'^'-^n^-^) cases and 0(n5"+''n™+^) time for Si. D 

Corollary 3.6. For the set F of generators given in Fig. the score of a minimum 
alignment of a decomposable folded sequence with ni bases, and an arbitrary one 
with n2 bases, can be calculated in OirJ-yrJ^ time and 0(ri|ri|) space. 

3.3. The Approximation of arbitrary alignments. Obviously the calculations 
can be made for two arbitrary folded sequences, resulting in a minimum semi- 
decomposable alignment. As we have seen, this leads to a globally minimum align- 
ment, if at least one of them is decomposable. But what can be said about the 
calculated score if neither structure is decomposable. 

Let S'docomp = '5'o[(lj ni); (1, 712)] be the score calculated by our algorithm. Then 
we have the following result regarding the quality of the approximation. 

Theorem 3.7. // there exists a constant c > 1, such that 



for all X, y, x' , y' £ S with x ^ x' or y ^ y' , then 

Sdecompiio-l,Si), (ct2,S2)) < c5((cri,Si), ((72,52)) 

for all folded sequences (ak, Sk),k — 1,2. 

Proof. Let (17,^1,^2) be a minimum alignment. We are going to construct a semi- 
decomposable alignment, replacing all mismatched pairings by an insertion and a 
deletion. This procedure allows us to control the increase of the score and leads to 
the stated inequality, and it ensures that the constructed alignment remains semi- 
decomposable, since its decomposable core is constructed from the initial alignment 
by the deletion of pairings. 
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Let M = {(i, j) e a \ ti[i\ ^ t2[i] or ti[]\ ^ t2[i] and tk[i],tk[i] ^ o} be the set 
of mismatched pairings. Then we have: 



5'((cri,Si),(CT2,t2)) = S{a,ti,t2) 



jGcr ^ («j)ec 



:=yl 



■^ . ■■ 

■.=B 

>A + B+ V -(sM'''^'^)+s(°^°, 
>\{A + B) 

> -<S'dccomp((o'l, Si), (o'2,S2)) 

c 



D 



If the scores satisfy 



for at least one combination of x,y,x' ,y' G E with x ^ x' or y =/= y' , then one may 
simply choose: 

max { ^° ' °^ , }^L^]Ll I 2- -^ 2.' or y 7^ y' y > 1. 

In particular, one obtains: 

Corollary 3.8. S'((cri, si), (0-2, S2)) = 5'decomp((o-i, si), (0-2, S2)) for arbitrary folded 
sequences {ak, Sk),k — 1,2, if 

^/x,y\ ^/x,y\ ^/o,o 
\x' , y'y \^o , oj yx' , y' 

/or X ^ x' or y ^ y' . 

4. Conclusion 

We presented a new formal description of secondary RNA structures incorporat- 
ing a wide variety of pseudoknots, i.e. tertiary structures. Based on this description 
it is possible to calculate the exact score of a minimal structural alignment of a de- 
composable and an arbitrary structure. For other cases the algorithm provides 
an approximation of guaranteed quality, depending on the chosen weights for the 
underlying edit operations. 
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For the set of generators given in Fig. Elthe algorithm requires 0(n^^) time and 
0{n^) space, where n is the number of bases in the longer sequence. Hence, further 
improvements are necessary. For example, the structures may be reduced to their 
underlying stems, i.e. sequences of nested pairs, decreasing the number of nodes. 
On the other hand, it might be possible to restrict to special decompositions of the 
intervals, which would reduce the required space. This would at least lead to an 
approximative algorithm producing exact results in many cases. To which extent 
and at which quality remains to be analyzed. 

On the other hand the set of structures, for which the algorithm produces exact 
results, can be extended by the addition of generators. This causes an increasing 
runtime. But as long as only a finite number of generators is used, the algorithm 
stays polynomial, even though the general problem is NP-complete. Hence the ad- 
dition of generators leads to a sequence of polynomial solvable problems (possibly) 
"converging" to an NP-complete problem. Hence it may be interesting to examine 
the gap between decomposable and non-decomposable structures. 
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